Models to predict length of stay in the emergency department: a systematic literature review and appraisal

Introduction Prolonged Length of Stay (LOS) in ED (Emergency Department) has been associated with poor clinical outcomes. Prediction of ED LOS may help optimize resource utilization, clinical management, and benchmarking. This study aims to systematically review models for predicting ED LOS and to assess the reporting and methodological quality about these models. Methods The online database PubMed, Scopus, and Web of Science (10 Sep 2023) was searched for English language articles that reported prediction models of LOS in ED. Identified titles and abstracts were independently screened by two reviewers. All original papers describing either development (with or without internal validation) or external validation of a prediction model for LOS in ED were included. Results Of 12,193 uniquely identified articles, 34 studies were included (29 describe the development of new models and five describe the validation of existing models). Different statistical and machine learning methods were applied to the papers. On the 39-point reporting score and 11-point methodological quality score, the highest reporting scores for development and validation studies were 39 and 8, respectively. Conclusion Various studies on prediction models for ED LOS were published but they are fairly heterogeneous and suffer from methodological and reporting issues. Model development studies were associated with a poor to a fair level of methodological quality in terms of the predictor selection approach, the sample size, reproducibility of the results, missing imputation technique, and avoiding dichotomizing continuous variables. Moreover, it is recommended that future investigators use the confirmed checklist to improve the quality of reporting.


Introduction
Overcrowding in the Emergency Department (ED) is an important worldwide problem [1][2][3] and it has received considerable international attention in recent years [4][5][6][7][8].Rising demand for ED services and relative shortage of hospital beds are major causes of ED crowding and longer waiting times [4].Length of Stay (LOS) in ED is usually defined as the time from patient registration in ED to patient discharge or transfer to another facility, or ward [2,9].ED LOS is perceived as an important component of ED overcrowding [7,9] and a quality indicator for ED throughput [6].
Longer LOS in ED had poor clinical outcomes such as increased mortality/morbidity [7] and complication rates, decreased quality of care [1,2] and patient satisfaction, ambulance diversion, and higher levels of recurrent ED crowding [2,3].Thus, LOS is an important measure of treatment timeliness when correcting for the severity of illness, patient safety, patient satisfaction, and quality of care in ED [2,6,8,9].Predicting length of stay is important in clinical and informatics research [10] and important to improve ED care and efficiency [3,11].The model's predicted ED LOS may provide useful information for physicians or patients to better anticipate an individual's LOS and to help the administrative level plan its staffing policy [12].Additionally, the development of a prediction tool could assist in bed management and patient flow through ED and hospitals [13].
Many studies have been conducted to develop ED LOS prediction models.However, to the best of our knowledge, no previous systematic literature review has summarized these studies.Given the lack of evidence, additional research is needed to explore the related studies in this area and to address this knowledge gap.Considering recent evidence demonstrating the limited implementation and thus limited impact of hospital policies to improve patient flow through the ED is important [10,11].
This study aims to systematically review and appraise the reporting and methodological quality of all development (with or without internal validation) and external validation studies describing a model aimed at predicting LOS in ED.It also provides recommendations for improving their reporting a prediction model for ED LOS.

Search strategy
We searched the PubMed (Medline), Scopus, and Web of Science databases for journal articles based on keywords in all fields until 10 September 2023, using the following query: ("length of stay") AND (emergency OR urgent) AND (prognostic OR prognosis OR predict*).All references were imported into the literature management program EndNote.All results were screened for relevance against our inclusion and exclusion criteria.

Inclusion and exclusion criteria
All original papers were included if they have described either the development (with or without internal validation) or external validation of a prediction model for LOS in emergency department patients.All duplicate articles, conference abstracts, and reviews were excluded.Only English articles were included.The review follows the 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines recommended by the Cochrane Handbook for Systematic Reviews of Interventions [13].

Selection of studies
Two reviewers (H.K and R. F) independently screened the titles and abstracts using Rayyan 1 research tool.Rayyan provides cooperative work on the systematics review and easy to orders articles and extracts data for blinded screening and automatic removal of duplicates.The results were compared and discussed until a consensus was reached.Discrepancies between the two reviewers were resolved by consensus involving a third reviewer (S.E). Figure 1 shows the search flowchart.

Assessment of methodological and reporting quality
We used a checklist developed for critical appraisal and data extraction for systematic reviews of prediction modeling studies (CHARMS) [14].This consists of eleven domains, each containing several (one to six) key items, resulting in a total of 32 key items [14].We extended this checklist with three additional items taken from a scoring framework for assessing the quality of reporting in prediction model development studies [12] (Table 1).The total number of included key items was 39 for 12 different domains.
We extracted 11 items from the literature to evaluate the methodological quality of model development studies [12,14,49,50] (Table 2).
Each key item was rated as 'yes' , 'partly' , or 'not' for the reporting as well as for the methodological quality, with a respective score of 2, 1, or 0. We summarized these results to rate the reporting and methodological quality of the model development studies.Table 2 describes the extracted data items to quantify each particular domain of the checklist.

Search strategy
Online searching resulted in 12,193 articles.Initial screening of titles and abstracts rendered 124 articles for full-text review.Based on the full-text review, 90 articles were excluded because they focused on factors associated with ED LOS, or no prediction model was reported.As shown in Table 3, 34 articles were included for full-text analysis and data extraction.In total, 29 models were developed [15-37, 39, 42, 43, 45, 47, 48] and five studies [40,41,44,46,47] described the validation of the Emergency Severity Index (ESI), Canadian emergency department Triage and Acuity Scale (CTAS), or ENP-stream models.

Participants
Only one paper did not report the year of study [45].
Only one study included patients who left the ED against medical advice (including discharge due to critical condition), who were transferred to another hospital, or were discharged from the ED after LOS > 24 h of observation, and/or died in the ED [44].Other studies did not mention readmissions, transfer from or to another ED/ Hospital, and patients who did not survive ED stay.

Candidate predictors
Not all studies reported on the predictor selection strategy.Table 2 shows the number and type of predictors in each model.Predictor variables were mostly measured at admission time or within the first 24 h of admission.Predictors selected for inclusion in modeling may have a large but spurious association with the outcome, which leads to predictor selection bias.Including such predictors increases the likelihood of over-fitting and thus overoptimistic predictions of a model's performance for other individuals [49].The number of continuous predictors was 0 [24,36,39,41,45,47,48] or 1 (age) [17,21,28,35,38,40,[42][43][44]46] or 2 [15,20,30] or 3 [26,31] or 4 [33,34] or 7 [29] or 8 [25] or 9 [19] or ten [27] or eleven [37] or eighteen [23].The number of categories of all categorical predictors ranged from 0 to 19.Two studies used cut points to categorize continuous variables [20,39].Only one study used logarithmic transformation to transform the skewed continuous variables to approximately conform to normality [41].
As shown in Table 2, age, gender, acuity level, mode of arrival, patient disposition, and insurance type are important predictors for ED LOS that were used in most studies.

Sample size
The number of registered patients ranged from 100 [42] to over 4 million [16,43] and the number of patients selected for model development or validation was between 42 [42] and 4,645,483 [16] patients.

Missing data
Most studies did not describe the completeness of data and/or handling of missing data.Some studies excluded all missing data for development and validation models.Ignoring the missing data can introduce bias.It is especially poor when the percentage of missing values per attribute varies considerably [23].Differences between studies in the amount, type of missing data, and the methods used to handle this missing data may markedly influence model development and predictive performance.Only eight studies reported on the percentage of missing values [17,21,23,28,38,42,43,47] and two studies described the handling of missing data [19,22].Specifically, these studies excluded all missing data for development and validation models.
Eight studies evaluated univariate associations with a prolonged LOS [24, 25, 27-29, 32, 35, 36].Three studies used all candidate variables.The remaining studies did not mention how the initial set of variables was selected.Further details are shown in Table 2. Also, Table 4 shows the factors analyzed and statistics of the selected studies for this systematic review.

Model performance measures
Fourteen studies reported calibration measures (i.e. the agreement between predictions and observed outcomes) among which six studies used the Hosmer-Lemeshow goodness-of-fit test [17,18,31,34,35,39,48], two studies used the visual inspection of the observed vs. predicted proportions [31,43], five studies used the mean squared error [15,17,19,30,31], one study used the life-table method [34], two studies used calibration plots [17,29], one study used the kappa statistic [45], and one study used the linear regression method to inspect the association of forecasts with the actual outcomes [48].A total of 13 studies used the Receiver Operating Characteristic (ROC) curve to quantify the discrimination power of the prediction model (i.e. the ability of the model to discriminate between those with and those without the event) [15-18, 22-24, 29, 31, 32, 35, 37, 39].Nine studies also calculated the sensitivity, specificity, and positive and negative predictive values [15-18, 22, 23, 29, 31, 37].Note that limited use of the popular performance measures prevents us from integrating the prediction powers of the models.

Model evaluation
Among development studies, sixteen studies performed internal validation, which useda subset of the training dataset to estimate the model performance (N = 9 split sample and N = 7 cross-validation) [15-19, 22-24, 28-32, 43, 45, 48], three studies used the entire dataset for both training and evaluating the model [34,35,39], and twelve studies performed no evaluation approach [20, 21, 25, 26, 33-37, 40, 44, 47].All six external validation studies assessed the predictive validity of the previously published models by investigating the relationship between scores and ED LOS, mostly using the correlation coefficients.
Emergency severity index (ESI), Canadian Emergency Department Triage and Acuity Scale (CTAS), Charlson comorbidity index (CCI), Korean Triage and Acuity Scale (KTAS), Pronto Atendimento Geriátrico Especializado (ProAGE) and Emergency Nurse Practitioners (ENPs) were six triage instruments that were validated by nine studies to assess these instruments in predicting ED LOS, hospital admission, and number of resources utilized.The results of these studies showed that there was an excellent correlation between the ESI (version 3&4), CTAS, and ENP-streaming and patients' injury severity.The findings of these studies showed that mean LOS was significantly shorter for the patients in the ENP stream in comparison with their counterparts [41].The mean of LOS in ED was also significantly higher for the patients with higher acuity levels in comparison with the patients with lower acuity levels (257 vs. 143, P < 0.001) [40].Moreover, the patients with ESI 4-5 and 2-3 had the shortest and longest LOS in ED, respectively [44,46].

Reporting on the developed model
All studies that developed a new model (n = 29) reported the final model.However, since it was not possible to provide a comprehensible representation of the ANN model, only the relative importance of each variable was estimated by counting the number of times each variable was selected as one of the top five variables by each NN in the ensemble.An ensemble is a 'committee' of neural networks that usually outperforms single neural networks.[45].Six studies reported the regression coefficients [22,29,30,38,39,43] and eleven studies were reproducible, since the final model, initial predictors, and final set of variables included in the model were reported [16-19, 22, 23, 28, 29, 34, 39, 45, 48].

Reporting and methodological quality assessment score
Table 1 shows domains and (key) items of the used CHARMS [15] checklist accompanied with the reporting and methodological scores used for quality assessment of the studies.The highest possible reporting scores for the development and validation studies were 67 and 43 respectively.The total score per reporting item ranged from 0 to 68 which is the sum of the reporting score [0, 1, 2] over models.The highest methodological score was 8 for development studies and 6 for validation studies.The total score achieved per methodological item (the sum of the methodological scores [0, 1, 2] over models) ranged from 0 to 68.

Discussion
The average length of stay is an increasingly concerning issue and an important index for bed administration, patient care, and consequently benchmarking of the emergency departments.Accurate prediction of LOS in ED will help physicians make informed decisions during risk assessment and patient stratification.This study aimed to quantify the methodological and reporting quality of prediction models which have been developed or externally evaluated to predict the LOS in ED.
The most important finding of this study is the remarkable differences in methods used for model development, different thresholds used to categorize the dependent variable, and inclusion of different patient groups which affected the comparability of the models.A total of 34 studies were published from 1994 to 2023 aiming to develop (N = 29) or externally validate (N = 5) the prediction models for LOS in ED.Different modeling approaches were used to generate the function predicting the outcome.Since the linear regression method is not applicable when the normality assumption is violated, about %44 of the development studies dichotomized the dependent variable using different thresholds and applied the Logistic Regression method.Five studies used different machine learning techniques to predict ED LOS.Of these, Gradient Boosting (GB) in two studies and CAT-Boost and generative adversarial network (GAN) in two other studies had the best results in predicting LOS [17,19,22,23].In one study Logistic Regression shows better results than machine learning methods [18].In addition, Logistic Regression still had similar results compared to machine learning approaches.
Two studies used the Coxian phase-type distribution method and quantile regression because the response variable was highly skewed to the left [33,40].These methods seemed to be useful because, in the emergency setting, we need to make a serious investigation not only on the middle of the distribution but also on extreme events.ANN was also used in five studies [15,16,22,37,45].Using different types of ANN, multilayer perceptron (MLP) had significant results than another type of ANN [37].It has the advantage over Logistic Regression when the relationships between the inputs and the outputs are not straightforwardly expressed in a pre-specified parametric model.However, the lack of model specification and proneness to over-fitting makes it difficult to be used in clinical and administrative judgments.Tandberg et al. used time series analysis [35].This approach can be useful when data are repeatedly measured over time.Gill et al. reported that they used the GBM method because it allows for modeling of interactions and nonlinearities within the data and can handle a large number of variables [33].One study used a decision tree.This method can demonstrate important patterns intuitively, helping the clinician to make sense of potentially complex combinations of factors [28].
About 40% and 33% of the studies reported calibration and discrimination measures for categorized outcomes, respectively.The Hosmer-Lemeshow goodness-of-fit test was the most frequently used test to assess the agreement between predicted probabilities and observed outcomes for categorized outcomes.However, this widely used test has several drawbacks (e.g., poor interpretation and limited power).Moreover, the ROC curve which is the most popular method to evaluate the discrimination power of the prediction models with binary variables was only used in thirteen studies among which only nine studies calculated the classification-based performance measures (e.g., sensitivity, specificity, etc.).There are numerous traditional and novel performance measures for estimating the prediction power of the models [54] which have been rarely used in both development and evaluation studies.
Patient triage and resource optimization was the most mentioned intention of the model in the included studies.Triage is commonly used to rapidly identify the patients who require immediate care and the patients who cannot wait before being evaluated and treated.Once the LOS is precisely predicted, the physicians will perform an informed and accurate risk assessment and consequently patient stratification.This will also result in helping optimize the bed occupation rate as well as resource utilization in crowded Eds [55].
Both development and validation studies completely reported the following key items: number and type of predictors, definition of the candidate predictors, time of predictor measurement, number of participants and outcomes/events, and event/(binary) variable ratio, model interpretation, source of data, and sample size.

Limitations and strengths
A strength of our study is that we systematically assessed the studies found by a framework published by Moons et al. (CHARMS) [14] extended with additional items from other studies that developed a prediction model [12,56,57] to assess the studies and models on reporting and methodological quality.We included studies that developed prediction models for ED LOS and did not include studies that evaluate whether a specific characteristic influences or is a predictor for ED LOS.Another strength is that this is the first systematic review of ED LOS prediction models for emergency department patients.
Our study has some limitations over previous reviews of prediction models for LOS in emergency departments.First, there exist some prediction models developed for patients with ED LOS which do not meet our inclusion criteria because they partly addressed the prediction of ED LOS.Second, there is possible some papers are missed in our review.Third, we limited our research to English-language articles.Fourth, we researched only one database, PubMed.Our research terms may not have revealed all aspects of the topic.

Implications for clinicians/policymakers/researchers/ model developers
Available prediction models for LOS in ED have poor to fair levels of methodological and reporting quality which makes them barely useful for clinical practice and administrative decision making.Many important issues are required to be addressed to provide accurate predictions of the LOS in ED.

Future research
We recommend that all development and validation studies use a clear definition of LOS in ED.This might be considered as an essential prerequisite for the comparability of the models.Moreover, models that have not been validated may not perform well in practice because of deficiencies in the development methods or because the new sample is too different from the original.Thus, it is highly recommended to evaluate available models on different datasets and update them if required.It should be noted that using the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) checklist can help future investigators to improve the reporting quality and indirectly the methodological quality of prediction model studies.

Conclusion
Various studies on prediction models for ED LOS were published but they are fairly heterogeneous and suffer from methodological and reporting issues.Model development studies were associated with a poor to a fair level of methodological quality in terms of the predictor selection approach, the sample size, reproducibility of the results, missing imputation technique, and avoiding dichotomizing continuous variables.Moreover, it is recommended that future investigators use the confirmed checklist to improve the quality of reporting.Physicians considering using these models to predict ED LOS should interpret them with reservation until a validation study using recent local data has shown that they obtain moderate calibration and produce accurate predictions.

Fig. 1
Fig. 1 PRISMA flow diagram of the study screening process bility and recruitment method (e.g., consecutive participants, location, number of centers, setting, country, inclusion and exclu-(e.g., single or combined endpoints) (e.g., complete-case analysis, imputation, or other methods) predictors during multivariable modeling (e.g., full model approach, backward or forward selection) and criteria used (e.g., p-value, dataset only (random split of data, resampling methods e.g.bootstrap or cross-validation, none) or separate External validation (e.g.temporal, geographical, different setting, different investigators)

S
method (e.g., consecutive participants, location, number of centers, setting, country, inclu-(e.g., single or combined endpoints) the outcome (e.g., in panel or consensus diagnosis) measurement (e.g., at patient presentation, at diagnosis, at treatment initiation) in the modeling (e.g., continuous, linear, non-linear transformations or categorized) outcomes/events in relation to the number of candidate predictors (Events Per Variable) (e.g., complete-case analysis, imputation, or other methods) Modeling method (e.g., logistic, survival or machine learning techniques) selection of predictors for inclusion in multivariable modeling (e.g., all candidate predictors, pre-selection based on unadjusted association with the outcome) or regression coefficients (e.g., no shrinkage, uniform shrinkage, penalized estimation) statistic, D-statistic, log-rank) measures with confi- tive presentation of the final prediction models, e.g., sum score, monogram, score chart, predictions for specific risk subgroups with per-

a
One or more methodological scores are given to this item b Additional items were added to the checklist from a scoring framework developed for reviewing models to predict mortality in very premature infants[14]

Table 1
Adopted domains and (key) items of the used CHARMS[15]checklist accompanied by the reporting-and methodological score per item

Table 2
Summary of exclusion used to include ED admissions for model development and/or model validation.Information on predictor variables included and/or predictor variables applied in the model which is validated by the included studies

Table 3
Characteristics of the selected studies for the systematic review